Many statistical tools and techniques used in data analysis are based on probability, a measures on a scale from 0 to 1 the likelihood of an event occurence (0 the event never occurs, 1, the event always occurs.).
Variables in the columns of a data set can be thought of as random variables: their values varying due to chance.
The distribution of the outcomes (values) of a random variable is described using probability distribution (function). In statistics, there are several common probability distributions, corresponding to various "shapes". The most commonly used to model real life random events are the Uniform, Normal, Binomial, Exponential, Poisson, and Lognormal distributions.
This notebook presents several common probability distributions and how to work with them in Python.
The uniform distribution is a probability distribution where each value within a certain range is equally likely to occur and values outside of the range never occur.
In [2]:
# Uncomment next command if you need to install a missing module
#!pip install statsmodels
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
In [12]:
uniform_data = stats.uniform.rvs(size = 10000, # Generate 10000 numbers
loc = 0, # From 0
scale = 20) # To 20
pd.DataFrame(uniform_data).plot(kind = "density", # Plot the distribution
figsize = (9,9),
xlim = (-5,25))
Out[12]:
The plot above is an approximation of the theoretical uniform distribution because it is based on a sample of observations: we generated 10,000 data points from a uniform distribution spanning the range 0 to 20.
In the density plot, we see that the density of sample data is almost level, i.e., all values have the same probability of occurring.
Note: the area under a probability density curve is always equal to 1.
More useful scipy functions:
stats.distribution.rvs() generates random numbers from the specified distribution. The arguments to rvs() vary depending on the type of distribution considered, e.g., the starting and ending points and the sample size for a uniform distribution.
stats.distribution.cdf() returns the probability that an observation drawn from a distribution is below a specified value (a.k.a., the cumulative distribution function) calculated as the area under the distribution's density curve to the left of the specified value (on the x axis). For example, in the uniform distribution above, there is a 25% chance that an observation will be in the range 0 to 5 and a 75% chance it will fall in the range 5 to 20. We can confirm this with cdf():
In [13]:
stats.uniform.cdf(x = 5, # quantile to check
loc = 0, # start value of the distribution
scale = 20) # end value of the distribution
Out[13]:
In [15]:
stats.uniform.ppf( q = 0.2, # Probability cutoff
loc = 0, # start value of the distribution
scale = 20) # end value of the distribution
Out[15]:
Probability distribution functions in scipy also support median(), mean(), var() and std().
In [19]:
for x in range(-2,24,4):
print("Density at x value " + str(x))
print( stats.uniform.pdf(x, loc = 0, scale = 20) )
To generate random real numbers in a range with equal probability one can draw numbers from a uniform distribution using the stats.distribution.rvs() described above.
Python also has with a library called "random" which comes equiped with various operations that involve randomization.
In [24]:
import random
print(random.randint(0,10)) # Get a random integer in the specified range
print(random.choice([2,4,6,9])) # Get a random element from a sequence
In [26]:
print(random.random()) # Get a real number between 0 and 1
In [25]:
print(random.uniform(0,10)) # Get a real in the specified range
Regardless of the method used to generate random numbers, the result of a random process can differ from one run to the next. If having different results each time is not desired, e.g., if results need to be reproduced exactly, one can ensure that the results are the same each time by setting the random number generator's seed value to a specific figure. random.seed() allows us to set the random number generator's seed.
Notice that we generate the exact same numbers with both calls to random.uniform() because the same seed is set before each call. This reproducibility illustrates the fact that these random numbers aren't truly random, but rather "pseudorandom".
Many Python library functions that use randomness have an optional random seed argument built in, e.g., the rvs() function has an argument random_state, which allows us to set the seed.
Note: there is a separate internal seed from the numpy library, which can be set with numpy.random.seed() when using functions from numpy and libraries built on top of it such as pandas and scipy.
In [28]:
random.seed(10) # Set the seed to 12
print([random.uniform(0,20) for x in range(4)])
random.seed(10) # Set the seed to the same value
print([random.uniform(0,20) for x in range(4)])
The normal (or Gaussian distribution) is a continuous probability distribution with a bell-shaped curve and is characterized by its center point (mean) and spread (standard deviation). Most observations from a normal distribution lie close to the mean, i.e., about 68% of the data lies within 1 standard deviation of the mean, 95% lies within 2 standard deviations and 99.7% within 3 standard deviations.
Note that many common statistical tests assume distributions are normal. (In the scipy module the normal distribution is referred to as norm.)
In [29]:
prob_under_minus1 = stats.norm.cdf(x = -1,
loc = 0,
scale = 1)
prob_over_1 = 1 - stats.norm.cdf(x= 1,
loc = 0,
scale= 1)
between_prob = 1 - (prob_under_minus1 + prob_over_1)
print(prob_under_minus1, prob_over_1, between_prob)
The figures above show that approximately 16% of the data in a normal distribution with mean 0 and standard deviation 1 is below -1, 16% is above 1 and 68% is between -1 and 1. Let's plot the normal distribution:
In [48]:
# Plot normal distribution areas*
#plt.rcParams["figure.figsize"] = (9,9)
plt.fill_between(x=np.arange(-4,-1,0.01),
y1= stats.norm.pdf(np.arange(-4,-1,0.01)) ,
facecolor='red',
alpha=0.35)
plt.fill_between(x=np.arange(1,4,0.01),
y1= stats.norm.pdf(np.arange(1,4,0.01)) ,
facecolor='red',
alpha=0.35)
plt.fill_between(x=np.arange(-1,1,0.01),
y1= stats.norm.pdf(np.arange(-1,1,0.01)) ,
facecolor='blue',
alpha=0.35)
plt.text(x=-1.8, y=0.03, s= round(prob_under_minus1,3))
plt.text(x=-0.2, y=0.1, s= round(between_prob,3))
plt.text(x=1.4, y=0.03, s= round(prob_over_1,3))
Out[48]:
The plot above shows the bell shape of the normal distribution as well as the area below, above and within one standard deviation of the mean. One can also check the quantiles of a normal distribution with stats.norm.ppf():
In [32]:
print( stats.norm.ppf(q=0.025) ) # Find the quantile for the 2.5% cutoff
print( stats.norm.ppf(q=0.975) ) # Find the quantile for the 97.5% cutoff
The figures above show that approximately 5% of the data is further than 2 standard deviations from the mean.
The binomial distribution is a discrete probability distribution that models the outcomes of a given number of random trails of some experiment or event. The binomial distribution has two parameters: the probability of success in a trial and the number of trials. The binomial distribution represents the likelihood to achieve a given number of successes in n trials of an experiment. We could model flipping a fair coin 10 times with a binomial distribution where the number of trials is set to 10 and the probability of success is set to 0.5. The scipy name for the binomial is binom.
In [49]:
fair_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.5, # Success probability
size=10000) # Number of trials
print( pd.crosstab(index="counts", columns= fair_coin_flips))
pd.DataFrame(fair_coin_flips).hist(range=(-0.5,10.5), bins=11)
Out[49]:
The binomial distribution is discrete and can be summarized with a frequency table (and a histogram). The histogram shows us that a binomial distribution with a 50% probability of success is roughly symmetric, but this is not the case in general.
In [34]:
biased_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.8, # Success probability
size=10000) # Number of trials
# Print table of counts
print( pd.crosstab(index="counts", columns= biased_coin_flips))
# Plot histogram
pd.DataFrame(biased_coin_flips).hist(range=(-0.5,10.5), bins=11)
Out[34]:
In [38]:
#cdf() check the probability of achieving a number of successes within a certain range
print(stats.binom.cdf(k=5, # Probability of k = 5 successes or less
n=10, # With 10 flips
p=0.8))# And success probability 0.8
print(1 - stats.binom.cdf(k=8, # Probability of k = 9 successes or more
n=10, # With 10 flips
p=0.8)) # And success probability 0.8
For continuous probability density functions, we can use pdf() to check the probability density at a given x value. For discrete distributions like the binomial, we can use stats.distribution.pmf(), the probability mass function to check the proportion of observations at given number of successes k.
In [39]:
print(stats.binom.pmf(k=5, # Probability of k = 5 successes
n=10, # With 10 flips
p=0.5)) # And success probability 0.5
print(stats.binom.pmf(k=8, # Probability of k = 8 successes
n=10, # With 10 flips
p=0.8)) # And success probability 0.8
The geometric distribution is discrete and models the number of trials it takes to achieve a success in repeated experiments with a given probability of success. The exponential distribution is continuous and models the amount of time before an event occurs given a certain occurrence rate. The scipy nickname for the geometric distribution is "geom". Below we model the number of trials it takes to get a success (heads) when flipping a fair coin:
In [40]:
random.seed(12)
flips_till_heads = stats.geom.rvs(size=10000, # Generate geometric data
p=0.5) # With success prob 0.5
# Print table of counts
print( pd.crosstab(index="counts", columns= flips_till_heads))
# Plot histogram
pd.DataFrame(flips_till_heads).hist(range=(-0.5,max(flips_till_heads)+0.5)
, bins=max(flips_till_heads)+1)
Out[40]:
In the 10,000 trails generated, the longest it took to get a heads was 13 flips. We can use cdf() to check the probability of needing 3 flips or more to get a success:
In [50]:
first_three = stats.geom.cdf(k=3, # Prob of success in first 5 flips
p=0.5)
1 - first_three
Out[50]:
In [42]:
#Use pmf() to check the probability of seeing a specific number of flips before a successes
stats.geom.pmf(k=2, # Prob of needing exactly 2 flips to get the first success
p=0.5)
Out[42]:
In [51]:
# Get the probability of waiting more than 1 time unit before a success
prob_1 = stats.expon.cdf(x=1,
scale=1) # Arrival rate
1 - prob_1
Out[51]:
Note: The average arrival time for the exponential distribution is equal to 1/arrival_rate. Let's plot this exponential distribution:
In [44]:
plt.fill_between(x=np.arange(0,1,0.01),
y1= stats.expon.pdf(np.arange(0,1,0.01)) ,
facecolor='blue',
alpha=0.35)
plt.fill_between(x=np.arange(1,7,0.01),
y1= stats.expon.pdf(np.arange(1,7,0.01)) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.3, y=0.2, s= round(prob_1,3))
plt.text(x=1.5, y=0.08, s= round(1 - prob_1,3))
Out[44]:
The Poisson distribution models the probability of seeing a certain number of successes within a time interval, where the time it takes for the next success is modeled by an exponential distribution. The Poisson distribution can be used to model the number of arrivals a hospital can expect in a hour's time, the number of emailsone can expect to receive in a day, etc. The scipy name for the Poisson distribution is "poisson". Below is a plot for data from a Poisson distribution with an arrival rate of 1 per time unit:
In [45]:
random.seed(12)
arrival_rate_1 = stats.poisson.rvs(size=10000, # Generate Poisson data
mu=1 ) # Average arrival time 1
# Print table of counts
print( pd.crosstab(index="counts", columns= arrival_rate_1))
# Plot histogram
pd.DataFrame(arrival_rate_1).hist(range=(-0.5,max(arrival_rate_1)+0.5)
, bins=max(arrival_rate_1)+1)
Out[45]:
The histogram shows that when arrivals are relatively infrequent, it is rare to see more than a couple of arrivals in each time period.
In [46]:
random.seed(12)
arrival_rate_10 = stats.poisson.rvs(size=10000, # Generate Poisson data
mu=10 ) # Average arrival time 10
# Print table of counts
print( pd.crosstab(index="counts", columns= arrival_rate_10))
# Plot histogram
pd.DataFrame(arrival_rate_10).hist(range=(-0.5,max(arrival_rate_10)+0.5)
, bins=max(arrival_rate_10)+1)
Out[46]:
We can use cdf() to check the probability of achieving more (or less) than a certain number of successes and pmf() to check the probability of obtaining a specific number of successes:
In [47]:
print(stats.poisson.cdf(k=5, # Check the probability of 5 arrivals or less
mu=10)) # With arrival rate 10
print(stats.poisson.pmf(k=10, # Check the prob f exactly 10 arrivals
mu=10)) # With arrival rate 10
Material adapted from: http://hamelg.blogspot.ca/2015/11/python-for-data-analysis-part-22.html